safety protocol
Optimizing AI Agent Attacks With Synthetic Data
Loughridge, Chloe, Colognese, Paul, Griffin, Avery, Tracy, Tyler, Kutasov, Jon, Benton, Joe
As AI deployments become more complex and high-stakes, it becomes increasingly important to be able to estimate their risk. AI control is one framework for doing so. However, good control evaluations require eliciting strong attack policies. This can be challenging in complex agentic environments where compute constraints leave us data-poor. In this work, we show how to optimize attack policies in SHADE-Arena, a dataset of diverse realistic control environments. We do this by decomposing attack capability into five constituent skills -- suspicion modeling, attack selection, plan synthesis, execution and subtlety -- and optimizing each component individually. To get around the constraint of limited data, we develop a probabilistic model of attack dynamics, optimize our attack hyperparameters using this simulation, and then show that the results transfer to SHADE-Arena. This results in a substantial improvement in attack strength, reducing safety score from a baseline of 0.87 to 0.41 using our scaffold.
`For Argument's Sake, Show Me How to Harm Myself!': Jailbreaking LLMs in Suicide and Self-Harm Contexts
Schoene, Annika M, Canca, Cansu
Recent advances in large language models (LLMs) have led to increasingly sophisticated safety protocols and features designed to prevent harmful, unethical, or unauthorized outputs. However, these guardrails remain susceptible to novel and creative forms of adversarial prompting, including manually generated test cases. In this work, we present two new test cases in mental health for (i) suicide and (ii) self-harm, using multi-step, prompt-level jailbreaking and bypass built-in content and safety filters. We show that user intent is disregarded, leading to the generation of detailed harmful content and instructions that could cause real-world harm. We conduct an empirical evaluation across six widely available LLMs, demonstrating the generalizability and reliability of the bypass. We assess these findings and the multilayered ethical tensions that they present for their implications on prompt-response filtering and context- and task-specific model development. We recommend a more comprehensive and systematic approach to AI safety and ethics while emphasizing the need for continuous adversarial testing in safety-critical AI deployments. We also argue that while certain clearly defined safety measures and guardrails can and must be implemented in LLMs, ensuring robust and comprehensive safety across all use cases and domains remains extremely challenging given the current technical maturity of general-purpose LLMs.
Humanoid robot malfunctions, sparks viral panic
Kurt Knutsson talks about a viral video that shows a humanoid robot going wild. A chilling video circulating on social media has reignited old anxieties about robots turning against their creators. The footage shows a Unitree H1 humanoid robot, a machine about the size of an adult human, suddenly flailing its arms and legs with alarming force during a test, coming dangerously close to two technicians. The scene has sparked heated debate about the safety of advanced robotics and artificial intelligence. But is this truly the beginning of something out of our worst fears, or is there just a straightforward technical explanation for what happened?
Jailbreaking is (Mostly) Simpler Than You Think
Russinovich, Mark, Salem, Ahmed
The rapid advancement of artificial intelligence has coincided with increasing concerns regarding the safe and ethical deployment of these systems. As AI models become more capable, ensuring that their behavior aligns with societal norms and safety standards has emerged as a critical research challenge. State-of-the-art alignment techniques--such as reinforcement learning from human feedback and rulebased fine-tuning--strive to constrain models to acceptable ethical behaviors. However, these methods face an inherent tension: while alignment is designed to prevent the disclosure of harmful or sensitive information, adversaries can leverage the gap between a model's potential and its restricted behavior through what is known as a jailbreak. In the context of AI, a jailbreak is any method that circumvents established safety protocols, effectively enabling functionalities that the system would otherwise suppress. Current jailbreaks typically deploy elaborate prompt constructions or optimization strategies; in contrast, in this paper we present the Context Compliance Attack (CCA), a simple optimization-free jailbreak. CCA leverages a basic yet critical design flaw--the reliance on client-supplied conversation history--to subvert the AI systems' safeguards and jailbreak them. This paper investigates the efficacy of CCA and explores its implications on current AI safety architectures.
Coal Mining Question Answering with LLMs
Rivera, Antonio Carlos, Moore, Anthony, Robinson, Steven
In this paper, we present a novel approach to coal mining question answering (QA) using large language models (LLMs) combined with tailored prompt engineering techniques. Coal mining is a complex, high-risk industry where accurate, context-aware information is critical for safe and efficient operations. Current QA systems struggle to handle the technical and dynamic nature of mining-related queries. To address these challenges, we propose a multi-turn prompt engineering framework designed to guide LLMs, such as GPT-4, in answering coal mining questions with higher precision and relevance. By breaking down complex queries into structured components, our approach allows LLMs to process nuanced technical information more effectively. We manually curated a dataset of 500 questions from real-world mining scenarios and evaluated the system's performance using both accuracy (ACC) and GPT-4-based scoring metrics. Experiments comparing ChatGPT, Claude2, and GPT-4 across baseline, chain-of-thought (CoT), and multi-turn prompting methods demonstrate that our method significantly improves both accuracy and contextual relevance, with an average accuracy improvement of 15-18\% and a notable increase in GPT-4 scores. The results show that our prompt-engineering approach provides a robust, adaptable solution for domain-specific question answering in high-stakes environments like coal mining.
OpenAI CEO Sam Altman says Elon Musk-backed letter calling for AI pause wasn't 'optimal way to address it'
Twitter and Tesla CEO Elon Musk weighs in on the dangers of artificial intelligence, the future of Twitter and more in an exclusive'Tucker Carlson Tonight' interview. OpenAI CEO Sam Altman says that a letter signed by Twitter CEO Elon Musk and others in the technology community calling for a pause on "giant AI experiments" wasn't the right way to address the issue. Musk, Steve Wozniak, and other tech leaders signed the letter in March, which asked AI developers to "immediately pause for at least 6 months the training of AI systems more powerful than GPT-4." During a virtual appearance at the Massachusetts Institute of Technology on Thursday, Altman addressed the letter. "There's parts of the thrust that I really agree with," Altman said, adding that his team spent more than six months after completing the training of ChatGPT 4 to study safety components before it was released.
Elon Musk's Urgent Warning, Demands Pause on AI Research
Elon Musk, the owner of Tesla Motors, has joined forces with several other industry experts to pen an open letter calling for a pause on the further development of AI tools like OpenAI's newly launched GPT-4. The letter cites potential "risks to society and humanity" as the primary reason for this request. By uniting prominent figures like Elon Musk and leading experts in the AI field, the letter emphasizes the need for responsible development and collaboration among industry stakeholders. The open letter, signed by Musk and other industry heavyweights, stresses the importance of ensuring that power systems are developed only once we have confidence in their positive effects and manageable risks. Tesla, which uses AI for its autopilot system, showcases Musk's own involvement in the field and the gravity of his concerns.
Elon Musk, Apple co-founder, other tech experts call for pause on 'giant AI experiments': 'Dangerous race'
As more companies rush to implement AI solutions and software, a growing number of experts are warning that it could result in an explosion of'fake news' and misinformation. Elon Musk, Steve Wozniak, and a host of other tech leaders and artificial intelligence experts are urging AI labs to pause development of powerful new AI systems in an open letter citing potential risks to society. The letter asks AI developers to "immediately pause for at least 6 months the training of AI systems more powerful than GPT-4." It was issued by the Future of Life Institute and signed by more than 1,000 people, including Musk, who argued that safety protocols need to be developed by independent overseers to guide the future of AI systems. GPT-4 is the latest deep learning model from OpenAI, which "exhibits human-level performance on various professional and academic benchmarks," according to the lab.
Could Your Robot be Spying on You?
Researcher led by the National University of Singapore recently demonstrated that household robot vacuum cleaners can be hacked to behave like listening devices, which spy on their unsuspecting owners. Could industrial robots be similarly compromised? Hackers have exploited LiDAR (Light Detection And Ranging) scanner technology, as used in the latest iPhone, to turn a household vacuum cleaner into a spying device. If that's not sinister enough, elsewhere, in an experimental stunt, a friendly-looking humanoid robot was hacked to act like Chucky, the evil killer doll from the Child's Play movies. A video shows the robot attacking a tomato while emitting an evil laugh. The latter experiment was designed to demonstrate the vulnerabilities of technologies that can be hacked in an increasingly-connected world.
How Conversational AI is Changing the U.S Healthcare Industry in the Post Pandemic Era – 3 Best Practices you Should Know
Technology has changed, and so has our buying habits. Today, we find advancements that can lessen our day-to-day operations and save our time & efforts. However, in the race to compete for excellence, we neglect our health and demand healthcare services quicker than expected. Global pandemic COVID-19 is one of the biggest examples of how humans can get affected by a single virus. Those with strong immunity overcome the impact of COVID and hence survive.